A current trend in high-performance computing is to decompose a large linear algebra prob-lem into batches containing thousands of smaller problems, that can be solved independently,before collating the results. To standardize the interface to these routines, the community isdeveloping an extension to the BLAS standard (the batched BLAS), enabling users to performthousands of small BLAS operations in parallel whilst making efficient use of their hardware.We discuss the benefits and drawbacks of the current batched BLAS proposals and performa number of experiments, focusing on GEMM, to explore their affect on the performance. Inparticular we analyze the effect of novel data layouts which, for example, interleave the ma-trices in memory to aid vectorization and prefetching of data. Utilizing these modificationsour code outperforms both MKL and CuBLAS by up to 6 times on the self-hosted Intel KNL(codenamed Knights Landing) and Kepler GPU architectures, for large numbers of DGEMMoperations using matrices of size 2 � 2 to 20 � 20.
展开▼